Clustering and classification are visual ways of exploring statistical data. In clustering, the aim is to group a set of objects in such a way that objects in the same group, i.e. cluster, are more similar (in some sense or another) to each other than to those in other groups clusters.
This week I am using a built-in dataset, namely the Boston dataset which contains housing information in the Boston Mass are. The data has been collected by the U.S Census Service and it is also available online.
The Boston dataset consists of just 506 observations and there are 14 variables. Let us take a closer look at what those variables are.
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
The variable names are not exactly self-explanatory. Because our analysis will again depend upon them, I will briefly explain what they are.
| Variable | Description |
|---|---|
| crim | per capita crime rate by town |
| zn | proportion of residential land zoned for lots over 25,000 sq.ft. |
| indus | proportion of non-retail business acres per town |
| chas | Charles River dummy variable (1 if tract bounds river; 0 otherwise) |
| nox | nitric oxides concentration (parts per 10 million) |
| rm | average number of rooms per dwelling |
| age | proportion of owner-occupied units built prior to 1940 |
| dis | weighted distances to five Boston employment centres |
| rad | index of accessibility to radial highways |
| tax | full-value property-tax rate per $10,000 |
| ptratio | pupil-teacher ratio by town |
| black | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town |
| lstat | % lower status of the population |
| medv | median value of owner-occupied homes in $1000’s |
Thus, the Boston dataset consists of a variety of variables: demographic, economic, and environmental factors as well as safety.
Let us next take a graphical tour of the data. In the figure below, each variable is plotted against the other variables.
Here we already see patterns emerging, such as * accumulation close to the edges and corners of the box - e.g. age/lstat
* diagonal shapes - e.g. nox/dis
* round shapes close to one of the corners - e.g. rad/tax
* binary positionas in all pairs - e.g chas and rad
SUMMARIES OF VARIABLES - given below with summary of scaled
Correlation matrix
DESCRIBE AND INTERPRET THE OUTPUTS (COMMENT ON DISTRIBUTIONS OF THE VARIABLES & RELATIONSHIPS BETWEEN THEM)
4 Standardization standardize dataset + summary:
## crim zn indus chas
## Min. : 0.0 Min. : 0.0 Min. : 0.46 Min. :0.000
## 1st Qu.: 0.1 1st Qu.: 0.0 1st Qu.: 5.19 1st Qu.:0.000
## Median : 0.3 Median : 0.0 Median : 9.69 Median :0.000
## Mean : 3.6 Mean : 11.4 Mean :11.14 Mean :0.069
## 3rd Qu.: 3.7 3rd Qu.: 12.5 3rd Qu.:18.10 3rd Qu.:0.000
## Max. :89.0 Max. :100.0 Max. :27.74 Max. :1.000
## nox rm age dis
## Min. :0.385 Min. :3.56 Min. : 2.9 Min. : 1.13
## 1st Qu.:0.449 1st Qu.:5.89 1st Qu.: 45.0 1st Qu.: 2.10
## Median :0.538 Median :6.21 Median : 77.5 Median : 3.21
## Mean :0.555 Mean :6.28 Mean : 68.6 Mean : 3.80
## 3rd Qu.:0.624 3rd Qu.:6.62 3rd Qu.: 94.1 3rd Qu.: 5.19
## Max. :0.871 Max. :8.78 Max. :100.0 Max. :12.13
## rad tax ptratio black lstat
## Min. : 1.00 Min. :187 Min. :12.6 Min. : 0 Min. : 1.7
## 1st Qu.: 4.00 1st Qu.:279 1st Qu.:17.4 1st Qu.:375 1st Qu.: 7.0
## Median : 5.00 Median :330 Median :19.1 Median :391 Median :11.4
## Mean : 9.55 Mean :408 Mean :18.5 Mean :357 Mean :12.7
## 3rd Qu.:24.00 3rd Qu.:666 3rd Qu.:20.2 3rd Qu.:396 3rd Qu.:17.0
## Max. :24.00 Max. :711 Max. :22.0 Max. :397 Max. :38.0
## medv
## Min. : 5.0
## 1st Qu.:17.0
## Median :21.2
## Mean :22.5
## 3rd Qu.:25.0
## Max. :50.0
## crim zn indus chas
## Min. :-0.42 Min. :-0.49 Min. :-1.556 Min. :-0.27
## 1st Qu.:-0.41 1st Qu.:-0.49 1st Qu.:-0.867 1st Qu.:-0.27
## Median :-0.39 Median :-0.49 Median :-0.211 Median :-0.27
## Mean : 0.00 Mean : 0.00 Mean : 0.000 Mean : 0.00
## 3rd Qu.: 0.01 3rd Qu.: 0.05 3rd Qu.: 1.015 3rd Qu.:-0.27
## Max. : 9.92 Max. : 3.80 Max. : 2.420 Max. : 3.66
## nox rm age dis
## Min. :-1.464 Min. :-3.88 Min. :-2.333 Min. :-1.27
## 1st Qu.:-0.912 1st Qu.:-0.57 1st Qu.:-0.837 1st Qu.:-0.80
## Median :-0.144 Median :-0.11 Median : 0.317 Median :-0.28
## Mean : 0.000 Mean : 0.00 Mean : 0.000 Mean : 0.00
## 3rd Qu.: 0.598 3rd Qu.: 0.48 3rd Qu.: 0.906 3rd Qu.: 0.66
## Max. : 2.730 Max. : 3.55 Max. : 1.116 Max. : 3.96
## rad tax ptratio black
## Min. :-0.982 Min. :-1.313 Min. :-2.705 Min. :-3.90
## 1st Qu.:-0.637 1st Qu.:-0.767 1st Qu.:-0.488 1st Qu.: 0.20
## Median :-0.522 Median :-0.464 Median : 0.275 Median : 0.38
## Mean : 0.000 Mean : 0.000 Mean : 0.000 Mean : 0.00
## 3rd Qu.: 1.660 3rd Qu.: 1.529 3rd Qu.: 0.806 3rd Qu.: 0.43
## Max. : 1.660 Max. : 1.796 Max. : 1.637 Max. : 0.44
## lstat medv
## Min. :-1.53 Min. :-1.906
## 1st Qu.:-0.80 1st Qu.:-0.599
## Median :-0.18 Median :-0.145
## Mean : 0.00 Mean : 0.000
## 3rd Qu.: 0.60 3rd Qu.: 0.268
## Max. : 3.55 Max. : 2.987
## [1] "matrix"
HOW DID THE VARIABLES CHANGE?
create categorical variable of crime rate (from scaled crime rate), quantiles as breakpoints:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.42 -0.41 -0.39 0.00 0.01 9.92
## 0% 25% 50% 75% 100%
## -0.41937 -0.41056 -0.39028 0.00739 9.92411
## crime
## low med_low med_high high
## 127 126 126 127
Division into train and test sets divide dataset into train and test so that 80% of data belongs to train set:
Linear discriminant analysis (LDA) fit linear discriminant analysis to train set (categorical crime rate as target variable & all others as predictor variables) + draw LDA (bi)plot:
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.262 0.248 0.240 0.250
##
## Group means:
## zn indus chas nox rm age dis rad tax
## low 0.974 -0.905 -0.0866 -0.887 0.4588 -0.887 0.884 -0.690 -0.741
## med_low -0.107 -0.307 0.0426 -0.567 -0.0952 -0.342 0.309 -0.548 -0.475
## med_high -0.388 0.187 0.2553 0.393 0.0778 0.447 -0.365 -0.453 -0.336
## high -0.487 1.017 -0.1164 1.036 -0.4532 0.831 -0.868 1.638 1.514
## ptratio black lstat medv
## low -0.466 0.3830 -0.7648 0.536
## med_low -0.121 0.3187 -0.1936 0.047
## med_high -0.267 0.0804 0.0429 0.148
## high 0.781 -0.6543 0.8986 -0.691
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.0630 0.7025 -1.0304
## indus 0.0329 -0.1325 0.2351
## chas -0.1013 -0.0895 0.0750
## nox 0.3605 -0.8075 -1.4696
## rm -0.1436 -0.0672 -0.1826
## age 0.1682 -0.3541 -0.1768
## dis -0.0245 -0.2903 -0.0200
## rad 3.6157 1.0492 -0.0716
## tax 0.1351 -0.0891 0.6960
## ptratio 0.1054 -0.0371 -0.4492
## black -0.1022 0.0710 0.1395
## lstat 0.3075 -0.3059 0.1675
## medv 0.2764 -0.4317 -0.2875
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9566 0.0333 0.0101
save crime categories from the test set:
Prediction predit the classes with LDA model on test data + crosstabulate with crime categories from the test set (= correct_classes):
## predicted
## correct low med_low med_high high
## low 11 10 0 0
## med_low 5 13 8 0
## med_high 1 6 19 3
## high 0 0 0 26
COMMENT ON RESULTS
K-means clustering reload Boston + standardize
calculate distances between observations:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.13 3.46 4.82 4.91 6.19 14.40
run k-means clustering [Euclidean]:
investigate optimal number of clusters & run k-means again: INTERPRET RESULTS
We were also given the code for producing a 3D plot. I did not have time to study it in detail, but I absolutely had to see what what it looks like, so here it is!
## [1] 404 13
## [1] 13 3
check colours (should be the crime classes of the train set)